dplyr-verbs, Piping & ExplorationWelcome to the follow-up to the “Intro to R”! If you see this you have managed to download the files from GitHub, which already makes you smarter than tolly500:
This means now it’s…
If you haven’t already:
R or
r-base)Install RStudio from the official website (please)
Install packages:
install.packages(c("tidyverse", "gapminder", "here", "markdown", "rmarkdown"))
and hit Enter.library(). So
please, don’t include install.packages() in your code!
Ever!tidyverse and “tidy” dataIn this workshop, we will use the tidyverse, which is a collection of R packages designed to make data manipulation and analysis as easy and coherent as possible. The tools are based around the concept of “tidy data”:
Artworks by @allison_horst
Today you will learn about the tools you can use with tidy data (including visualizations), and tomorrow you will learn how to (1) keep your projects tidy as well, and (2) turn messy into tidy data.
For our examples, we will use data from Gapminder for our examples, which is provided as an R-package:
dplyr-verbsdplyr is a “grammar” for data manipulation & the
core of the tidyverse. It follows the logic of “verbs”:
select() allows you to pick variables by namefilter() picks cases based on their valuesmutate() change existing or add new variablesarrange() changes the ordering of rowssummarise() reduces multiple values down to a single
valueClick here to get to the
dplyr-documentation, which is really comprehensive with
many examples!
Selecting:
If you look at ?select, you will see that it has two
arguments:
.data, which is the data frame from which to select
variables, and..., which variables to select.For example:
select(.data = gapminder, country, year, lifeExp)
#> # A tibble: 1,704 × 3
#> country year lifeExp
#> <fct> <int> <dbl>
#> 1 Afghanistan 1952 28.8
#> 2 Afghanistan 1957 30.3
#> 3 Afghanistan 1962 32.0
#> 4 Afghanistan 1967 34.0
#> 5 Afghanistan 1972 36.1
#> 6 Afghanistan 1977 38.4
#> 7 Afghanistan 1982 39.9
#> 8 Afghanistan 1987 40.8
#> 9 Afghanistan 1992 41.7
#> 10 Afghanistan 1997 41.8
#> # ℹ 1,694 more rowsAnother way of writing the same thing is using the “pipe” operator
(|> or %>%)1 : it takes what is on
its left-hand side and hands it to the function on the right-hand side2. If this
sounds complicated, just read the pipe in your head as “and then…”:
gapminder |> select(country, year, lifeExp)
#> # A tibble: 1,704 × 3
#> country year lifeExp
#> <fct> <int> <dbl>
#> 1 Afghanistan 1952 28.8
#> 2 Afghanistan 1957 30.3
#> 3 Afghanistan 1962 32.0
#> 4 Afghanistan 1967 34.0
#> 5 Afghanistan 1972 36.1
#> 6 Afghanistan 1977 38.4
#> 7 Afghanistan 1982 39.9
#> 8 Afghanistan 1987 40.8
#> 9 Afghanistan 1992 41.7
#> 10 Afghanistan 1997 41.8
#> # ℹ 1,694 more rowsTake gapminder, and then select
the variables country, year and
lifeExp. Some other useful tricks include:
# Dropping columns (negative selection):
gapminder |> select(-continent)
gapminder |> select(-c(continent, pop))
# Selecting columns by name patterns:
gapminder |> select(starts_with("c"))
gapminder |> select(ends_with("p"))
# Select from:to
gapminder |> select(country:lifeExp)You can also select everything(), which selects… well,
everything, kind of. You can use it e.g. to reorder columns (here, it
will select those explicitly called upon first, and then
everything() else):
gapminder |> select(pop, gdpPercap, everything())
#> # A tibble: 1,704 × 6
#> pop gdpPercap country continent year lifeExp
#> <int> <dbl> <fct> <fct> <int> <dbl>
#> 1 8425333 779. Afghanistan Asia 1952 28.8
#> 2 9240934 821. Afghanistan Asia 1957 30.3
#> 3 10267083 853. Afghanistan Asia 1962 32.0
#> 4 11537966 836. Afghanistan Asia 1967 34.0
#> 5 13079460 740. Afghanistan Asia 1972 36.1
#> 6 14880372 786. Afghanistan Asia 1977 38.4
#> 7 12881816 978. Afghanistan Asia 1982 39.9
#> 8 13867957 852. Afghanistan Asia 1987 40.8
#> 9 16317921 649. Afghanistan Asia 1992 41.7
#> 10 22227415 635. Afghanistan Asia 1997 41.8
#> # ℹ 1,694 more rowsFiltering: Subset your data, keeping rows that match a condition. Useful logics:
==), not equal (!=), not
(!), or (|), and (&)is.nabetween() & near()gapminder |> filter(continent == "Asia")
#> # A tibble: 396 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # ℹ 386 more rows
gapminder |> filter(year != 2007)
#> # A tibble: 1,562 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # ℹ 1,552 more rows
gapminder |>
filter(year == 2007 & continent %in% c("Asia", "Europe") & pop > 1e7)
#> # A tibble: 37 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 2007 43.8 31889923 975.
#> 2 Bangladesh Asia 2007 64.1 150448339 1391.
#> 3 Belgium Europe 2007 79.4 10392226 33693.
#> 4 Cambodia Asia 2007 59.7 14131858 1714.
#> 5 China Asia 2007 73.0 1318683096 4959.
#> 6 Czech Republic Europe 2007 76.5 10228744 22833.
#> 7 France Europe 2007 80.7 61083916 30470.
#> 8 Germany Europe 2007 79.4 82400996 32170.
#> 9 Greece Europe 2007 79.5 10706290 27538.
#> 10 India Asia 2007 64.7 1110396331 2452.
#> # ℹ 27 more rowsNow that we know two functions, we can see why piping is one of the
most powerful ideas in R. Say we want country names, life expectancy
& GDP per capita for all European countries in 2007. We can combine
a filter() and a select()ion into a
pipeline:
gapminder |>
filter(continent == "Europe" & year == 2007) |>
select(country, lifeExp, gdpPercap)
#> # A tibble: 30 × 3
#> country lifeExp gdpPercap
#> <fct> <dbl> <dbl>
#> 1 Albania 76.4 5937.
#> 2 Austria 79.8 36126.
#> 3 Belgium 79.4 33693.
#> 4 Bosnia and Herzegovina 74.9 7446.
#> 5 Bulgaria 73.0 10681.
#> 6 Croatia 75.7 14619.
#> 7 Czech Republic 76.5 22833.
#> 8 Denmark 78.3 35278.
#> 9 Finland 79.3 33207.
#> 10 France 80.7 30470.
#> # ℹ 20 more rowsTake gapminder, and then filter for
continent == "Europe and year == 2007, and
then select the variables country,
lifeExp and gdpPercap. If we want to keep
this, we can then assign the result of our pipeline:
europe <-
gapminder |>
filter(continent == "Europe" & year == 2007) |>
select(country, lifeExp, gdpPercap)Since R supports right-hand assignment, we can write this even more intuitively:
gapminder |>
filter(continent == "Europe" & year == 2007) |>
select(country, lifeExp, gdpPercap) -> europeBe warned though that many people consider right-hand assignment “bad practice” and more of a gimmick (for example it’s hard to see that there is an assignment operation happening here if you just skim the code).
Arranging:
gapminder |> arrange(lifeExp)
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Rwanda Africa 1992 23.6 7290203 737.
#> 2 Afghanistan Asia 1952 28.8 8425333 779.
#> 3 Gambia Africa 1952 30 284320 485.
#> 4 Angola Africa 1952 30.0 4232095 3521.
#> 5 Sierra Leone Africa 1952 30.3 2143249 880.
#> 6 Afghanistan Asia 1957 30.3 9240934 821.
#> 7 Cambodia Asia 1977 31.2 6978607 525.
#> 8 Mozambique Africa 1952 31.3 6446316 469.
#> 9 Sierra Leone Africa 1957 31.6 2295678 1004.
#> 10 Burkina Faso Africa 1952 32.0 4469979 543.
#> # ℹ 1,694 more rows
gapminder |> arrange(desc(pop))
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 China Asia 2007 73.0 1318683096 4959.
#> 2 China Asia 2002 72.0 1280400000 3119.
#> 3 China Asia 1997 70.4 1230075000 2289.
#> 4 China Asia 1992 68.7 1164970000 1656.
#> 5 India Asia 2007 64.7 1110396331 2452.
#> 6 China Asia 1987 67.3 1084035000 1379.
#> 7 India Asia 2002 62.9 1034172547 1747.
#> 8 China Asia 1982 65.5 1000281000 962.
#> 9 India Asia 1997 61.8 959000000 1459.
#> 10 China Asia 1977 64.0 943455000 741.
#> # ℹ 1,694 more rowsAgain, combining what we know to find the countries with the highest life expectancy:
gapminder |>
filter(continent == "Europe" & year == 2007) |>
select(country, lifeExp, gdpPercap) |>
arrange(desc(lifeExp))
#> # A tibble: 30 × 3
#> country lifeExp gdpPercap
#> <fct> <dbl> <dbl>
#> 1 Iceland 81.8 36181.
#> 2 Switzerland 81.7 37506.
#> 3 Spain 80.9 28821.
#> 4 Sweden 80.9 33860.
#> 5 France 80.7 30470.
#> 6 Italy 80.5 28570.
#> 7 Norway 80.2 49357.
#> 8 Austria 79.8 36126.
#> 9 Netherlands 79.8 36798.
#> 10 Greece 79.5 27538.
#> # ℹ 20 more rowsThink for a moment: how do you read this sequence in your head?
Renaming:
gapminder |>
select(country, year, lifeExp, gdpPercap) |>
rename("life_exp" = lifeExp, "gdp_pc" = gdpPercap)
#> # A tibble: 1,704 × 4
#> country year life_exp gdp_pc
#> <fct> <int> <dbl> <dbl>
#> 1 Afghanistan 1952 28.8 779.
#> 2 Afghanistan 1957 30.3 821.
#> 3 Afghanistan 1962 32.0 853.
#> 4 Afghanistan 1967 34.0 836.
#> 5 Afghanistan 1972 36.1 740.
#> 6 Afghanistan 1977 38.4 786.
#> 7 Afghanistan 1982 39.9 978.
#> 8 Afghanistan 1987 40.8 852.
#> 9 Afghanistan 1992 41.7 649.
#> 10 Afghanistan 1997 41.8 635.
#> # ℹ 1,694 more rowsMost of this is self-explanatory right? Here is a tip: you can rename
inside select(), so you rarely need
rename():
gapminder |>
select(country, year, "life_exp" = lifeExp, "gdp_pc" = gdpPercap)
#> # A tibble: 1,704 × 4
#> country year life_exp gdp_pc
#> <fct> <int> <dbl> <dbl>
#> 1 Afghanistan 1952 28.8 779.
#> 2 Afghanistan 1957 30.3 821.
#> 3 Afghanistan 1962 32.0 853.
#> 4 Afghanistan 1967 34.0 836.
#> 5 Afghanistan 1972 36.1 740.
#> 6 Afghanistan 1977 38.4 786.
#> 7 Afghanistan 1982 39.9 978.
#> 8 Afghanistan 1987 40.8 852.
#> 9 Afghanistan 1992 41.7 649.
#> 10 Afghanistan 1997 41.8 635.
#> # ℹ 1,694 more rowsSummarising: Compute summary values
gapminder |>
filter(year == 2007) |>
summarise(mean_life_exp = mean(lifeExp, na.rm = TRUE))
#> # A tibble: 1 × 1
#> mean_life_exp
#> <dbl>
#> 1 67.0You may ask yourself: why not just use mean at this
point3?
summarise only really shines when combined with
group_by() to compute grouped summary statistics or
aggregates:
gapminder |>
filter(year == 2007) |>
group_by(continent) |>
summarise(mean_life_exp = mean(lifeExp, na.rm = TRUE)) |>
arrange(desc(mean_life_exp))
#> # A tibble: 5 × 2
#> continent mean_life_exp
#> <fct> <dbl>
#> 1 Oceania 80.7
#> 2 Europe 77.6
#> 3 Americas 73.6
#> 4 Asia 70.7
#> 5 Africa 54.8Don’t forget to ungroup() your data (or set
.groups = "drop") if you don’t want to do later
computations by groups.
Mutating: Add new or overwrite existing variables:
gapminder |> mutate(pop_million = pop / 1e6)
#> # A tibble: 1,704 × 7
#> country continent year lifeExp pop gdpPercap pop_million
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. 8.43
#> 2 Afghanistan Asia 1957 30.3 9240934 821. 9.24
#> 3 Afghanistan Asia 1962 32.0 10267083 853. 10.3
#> 4 Afghanistan Asia 1967 34.0 11537966 836. 11.5
#> 5 Afghanistan Asia 1972 36.1 13079460 740. 13.1
#> 6 Afghanistan Asia 1977 38.4 14880372 786. 14.9
#> 7 Afghanistan Asia 1982 39.9 12881816 978. 12.9
#> 8 Afghanistan Asia 1987 40.8 13867957 852. 13.9
#> 9 Afghanistan Asia 1992 41.7 16317921 649. 16.3
#> 10 Afghanistan Asia 1997 41.8 22227415 635. 22.2
#> # ℹ 1,694 more rows
gapminder |>
mutate(
pop_million = pop / 1e6,
gdp = gdpPercap * pop
)
#> # A tibble: 1,704 × 8
#> country continent year lifeExp pop gdpPercap pop_million gdp
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. 8.43 6.57e 9
#> 2 Afghanistan Asia 1957 30.3 9240934 821. 9.24 7.59e 9
#> 3 Afghanistan Asia 1962 32.0 10267083 853. 10.3 8.76e 9
#> 4 Afghanistan Asia 1967 34.0 11537966 836. 11.5 9.65e 9
#> 5 Afghanistan Asia 1972 36.1 13079460 740. 13.1 9.68e 9
#> 6 Afghanistan Asia 1977 38.4 14880372 786. 14.9 1.17e10
#> 7 Afghanistan Asia 1982 39.9 12881816 978. 12.9 1.26e10
#> 8 Afghanistan Asia 1987 40.8 13867957 852. 13.9 1.18e10
#> 9 Afghanistan Asia 1992 41.7 16317921 649. 16.3 1.06e10
#> 10 Afghanistan Asia 1997 41.8 22227415 635. 22.2 1.41e10
#> # ℹ 1,694 more rowsOther useful stuff:
# count()
gapminder |>
filter(year == 2007) |>
count(continent) |>
arrange(desc(n))
#> # A tibble: 5 × 2
#> continent n
#> <fct> <int>
#> 1 Africa 52
#> 2 Asia 33
#> 3 Europe 30
#> 4 Americas 25
#> 5 Oceania 2# n()
gapminder |>
filter(year == 2007) |>
group_by(continent) |>
summarise(n_obs = n(), mean_life_exp = mean(lifeExp, na.rm = TRUE))
#> # A tibble: 5 × 3
#> continent n_obs mean_life_exp
#> <fct> <int> <dbl>
#> 1 Africa 52 54.8
#> 2 Americas 25 73.6
#> 3 Asia 33 70.7
#> 4 Europe 30 77.6
#> 5 Oceania 2 80.7For group_by(), if you continue your pipeline after
summarise() or whatever you did by group, you may
need to ungroup() (otherwise everything will be
done by group). If you want to arrange() your grouped data,
you may need to set .by_group = TRUE inside of
arrange().
If there is one thing to remember from this course, it’s the
dplyr-verbs, because they make data manipulation &
exploration really easy. They are building blocks with which you can do
most operations you will ever need for processing your
data.
Exercise Time:
lifeExp)
in 1952? Which were the top 10 in 2007? What changes do you see (in
terms of countries and in terms of life expectancy)?gapminder data set: Total GDP
in billion USD (we can assume that GDP = gdpPercap \(\times\) pop). Bonus:
In 2007, which countries had the highest total GDP?%>% is the original tidyverse-pipe
introduced in the magrittr-package, while |> is the base
R pipe (has only been around since R4.1). On the surface they don’t
really differ, except for a different placeholder (a concept we will
discuss later), but the base R pipe is a lot faster, and it looks cooler
imo (also it looks like a big arrow to the right if you squint a bit,
which conveys its purpose a lot better). Below the surface (for the R
nerds), rlang::expr(f(x)) == rlang::expr(x |> f()) is
TRUE, while
rlang::expr(f(x)) == rlang::expr(x %>% f()) evaluates to
FALSE, because the tidyverse-pipe engages in data masking
(which makes it slower & produces the infamous traceback-vomit if
you get errors in pipelines), while to the R interpreter
x |> f() is literally the exact same as
f(x), so there is no performance overhead. Consider
rlang::expr(x |> f()) even evaluates to
f(x). The %>% pipe is a bit more flexible
in some edge cases, but I would consider most of them bad practices
anyway.↩︎
As the first argument↩︎
Another way to write this would use pull()
to pull a single vector out of a dataframe:
gapminder |> filter(year == 2007) |> pull(lifeExp) |> mean(na.rm = TRUE)↩︎